What if you could beatbox* into an app that could process the audio, determine which sounds are which, and produce a matching audio track of real drum noises?
From Siri to Google Assistant, a handful of applications have explored and mastered decoding audio into human speech. The push for these kinds of excellent speech processing algorithms has resulted in a wealth of knowledge on the topic from blog posts to academic papers.
Beatbot uses these documented techniques along with a few novel strategies to analylze audio and replace beatbox sounds with drum sounds.
Here's are a few examples of what it can do.
* Beatboxing is vocal percussion (i.e.: "pft tss kah tss pft tss kah")
import os
import IPython
import librosa
from utils import plot_utils
%matplotlib inline
# Display helper
def display_before_and_after(label, before_file, after_file):
# Plot the 'before' audio
audio, _ = librosa.load(before_file, mono=True)
plot_utils.plot(
y=audio,
title=label + ' (before, after)',
xlabel='Time (samples)',
ylabel='Amplitude',
figsize=(20, 2)
)
# Display audio players
IPython.display.display(IPython.display.Audio(before_file))
IPython.display.display(IPython.display.Audio(after_file))
display_before_and_after(
'Example 1 (easy)',
'assets/input_audio/beatbox_1.wav',
'assets/output_audio/drum_track_1.wav'
)
display_before_and_after(
'Example 2 (easy)',
'assets/input_audio/beatbox_2.wav',
'assets/output_audio/drum_track_2.wav'
)
display_before_and_after(
'Example 3 (hard)',
'assets/input_audio/beatbox_3.wav',
'assets/output_audio/drum_track_3.wav'
)
display_before_and_after(
'Example 4 (hard)',
'assets/input_audio/beatbox_4.wav',
'assets/output_audio/drum_track_4.wav'
)
Pretty neat! Now lets see how Beatbot works behind the scenes.
Beatbot operates in three steps.
- First, it locates all of the beatbox sounds in the beatbox audio
- Next, it classifies which sounds are which
- And finally, it replaces the beatbox sounds with similar sounding drum sounds
The first step in Beatbot is to locate the starts and end of each beatbox sound.
Let's start out by taking another look at the input audio from Example 1.
BEATBOX_AUDIO_FILE = 'assets/input_audio/beatbox_1.wav'
# Load audio
audio, sample_rate = librosa.load(BEATBOX_AUDIO_FILE, mono=True)
# Plot
plot_utils.plot(y=audio, title='Beatbox audio', xlabel='Time (samples)', ylabel='Amplitude')
# Display audio player
IPython.display.Audio(BEATBOX_AUDIO_FILE)
Visually, we can already begin to see where the beats are located, but the raw waveform isn't good enough. Certain loud noises don't register as large amplitude spikes while certain quiet noises do.
To get a get a cleaner representation of the audio's loudness, we'll start by taking a look at the the frequencies of the audio over time.
We'll do this by applying the "Fourier transform" to small, overlapping windows of our audio wave.
import numpy as np
# Windows of audio 23ms long with 12ms of overlap
WINDOW_SIZE = 512
WINDOW_OVERLAP = 256
# Take the 'short time' Fourier transform of the audio signal
spectrogram = librosa.stft(y=audio, n_fft=WINDOW_SIZE, win_length=WINDOW_SIZE, hop_length=WINDOW_OVERLAP)
# Convert imaginary numbers to real numbers
spectrogram = np.abs(spectrogram)
# Convert from Hz to Mel frequency scale
mel_basis = librosa.filters.mel(sample_rate, WINDOW_SIZE)
spectrogram = np.dot(mel_basis, spectrogram)
# Convert from amplitude to decibels
spectrogram = librosa.amplitude_to_db(spectrogram)
# Plot
plot_utils.heatmap(data=spectrogram, title='Spectrogram', xlabel='Time (windows)', ylabel='Frequencies')
In this visualization, known as a spectrogram, we can clearly see where each beat is located. Yellower colors indicate loudness while bluer colors indicate quietness. Positions near the top represent high frequencies, while lower positions represent low frequencies.
To determine the loudness of our audio at any point, we just need to add up all the yellow energy in a given time column.
Check out the code for more specifics on how we calculated this spectrogram.
# Sum the spectrogram by column
volume = np.sum(spectrogram, axis=0)
# Plot
plot_utils.plot(volume, title='Audio volume', xlabel='Time (windows)', ylabel='Volume (decibels)')
Summing our spectrogram by column gives us the volume of the beatbox track over time.
We now can see clear peaks at each beatbox sound.
We'll now use a simple peak finding algorithm to determine how may peaks exist at each peak peak prominence value from zero to the largest peak prominence.
This approach will give us the flexibility to use Beatbot on tracks with varying volume.
import scipy
import matplotlib.pyplot as plt
MIN_PROMINENCE = 0
MAX_PROMINENCE = np.max(volume) - np.min(volume)
NUM_PROMINENCES = 1000
# Create an array of prominence values
prominences = np.linspace(MIN_PROMINENCE, MAX_PROMINENCE, NUM_PROMINENCES)
# For each prominence value, calculate the # of peaks of at least that prominence
num_peaks = []
for prominence in prominences:
peak_data = scipy.signal.find_peaks(volume, prominence=prominence)
num_peaks.append(len(peak_data[0]))
# Calculate the most frequent peak quantity
most_frequent_num_peaks = np.argmax(np.bincount(num_peaks))
#Plot
plot_utils.plot(
x=prominences,
y=num_peaks,
title='Number of peaks per prominence',
xlabel='Minimum peak prominence (decibels)',
ylabel='Number of peaks',
show_yticks=True,
show=False
)
line = plt.axhline(most_frequent_num_peaks, c='green', ls=':')
plt.legend([line], ['Most frequent number of peaks = 33'], fontsize=14)
plt.show()
The above graph shows how the number of peaks located changes as the required peak prominence is set at different values between zero and our maximum.
In the middle of this graph, an unusually long flat section exists where as we change the required peak prorminnence value, the resulting number of peaks located does not change.
This flat zone represents our most significant peaks in the signal and is the number of peaks we'll look for in our volume signal.
# Define the starts and ends of peak to be the intersection of 70% of the peak's prominence
PROMINENCE_MEASURING_POINT = 0.7
# Get a prominence value for which there are 33 peaks
prominence_index = np.where(np.array(num_peaks) == most_frequent_num_peaks)[0][0]
prominence = prominences[prominence_index]
# Calculate starts and ends of each peak
peak_data = scipy.signal.find_peaks(volume, prominence=prominence, width=0, rel_height=PROMINENCE_MEASURING_POINT)
peak_starts = peak_data[1]['left_ips']
peak_ends = peak_data[1]['right_ips']
# Plot
plot_utils.plot(volume, title='Audio volume', xlabel='Time (windows)', ylabel='Volume (decibels)', show=False)
for peak_start, peak_end in zip(peak_starts, peak_ends):
beat_start = plt.axvline(peak_start, c='green', alpha=0.5, ls='--')
beat_end = plt.axvline(peak_end, c='red', alpha=0.5, ls='--')
plt.legend([beat_start, beat_end], ['Beat start', 'Beat end'], fontsize=14, framealpha=1, loc='upper right')
plt.show()
Having determined the number of peaks in the volume signal, we obtain the start and end locations of each peak by looking for the points to the left and right of the peaks' tips which lie at 70% of the each peak's total prominence.
The value '70%' was chosen through trial and error.
# Scale up peak starts and ends
beat_starts = librosa.frames_to_samples(peak_starts, hop_length=WINDOW_OVERLAP)
beat_ends = librosa.frames_to_samples(peak_ends, hop_length=WINDOW_OVERLAP)
# Plot
plot_utils.plot(y=audio, title='Beatbox audio', xlabel='Time (samples)', ylabel='Amplitude', show=False)
for beat_start, beat_end in zip(beat_starts, beat_ends):
start_line = plt.axvline(beat_start, c='green', alpha=0.5, ls='--')
end_line = plt.axvline(beat_end, c='red', alpha=0.5, ls='--')
plt.legend([start_line, end_line], ['Beat start', 'Beat end'], fontsize=14, framealpha=1)
plt.show()
And finally, here are our beats' starts and ends overlayed onto the orginal audio signal.
Now that we the locations of all of the beats in the track, we need to determine which beats are which.
To start off, lets look at our beats.
from utils import multi_plot_utils
# Extract beats' audio
beats = []
for beat_start, beat_end in zip(beat_starts, beat_ends):
beat = audio[beat_start:beat_end]
beats.append(beat)
# Plot
multi_plot_utils.multi_plot(beats, title='Beats')
# Also here's the audio if you want to listen again
IPython.display.Audio(BEATBOX_AUDIO_FILE)
Listening to the audio again, our expected beat pattern should be:
0 1 1 12 1 1 11 1 0 12 1 1 10 1 1 12 1 1 11 1 0 12 1 1 13Where:
0= "pft"1= "tss"2= "khh"3= Unintentional knockLets visualize this.
EXPECTED_PATTERN = [0, 1, 1, 1, 2, 1, 1, 1, 1, 1, 0, 1, 2, 1, 1, 1, 0, 1, 1, 1, 2, 1, 1, 1, 1, 1, 0, 1, 2, 1, 1, 1, 3]
# Plot
multi_plot_utils.multi_plot(beats, title='Expected beat classification', pattern=EXPECTED_PATTERN)
Goal in mind, our first task in beat classification is to featurize our beats in some way that will let us compare them to one another. A proven audio featurization popular in speech processing is "Mel-frequency cepstral coefficients" (MFCCs).
MFCCs are feature vectors that are created in three steps:
- The input audio is windowed and transformed into its frequency comopnents via the Fourier transform
- Mel-coefficients are extracted from each set of frequencies in time (the Mel scale is essentially the human hearing scale)
- These values are compressed usually using the discrete cosine transform
Now lets extract some MFCCs from our beats.
# From each beat, create a set of MFCCs
mfccs = []
for beat in beats:
mfcc = librosa.feature.mfcc(
beat,
win_length=WINDOW_SIZE,
hop_length=WINDOW_OVERLAP,
n_fft=1024
)
mfccs.append(mfcc)
# Plot
plot_utils.heatmap(mfccs[0], figsize=(3, 4), title='MFCCs of first beat', xlabel='Time (windows)', ylabel='Features')
multi_plot_utils.multi_heatmap(mfccs, title='All beats\' MFCCs')
These all kind of look the same. That's fine though because the computer can tell them apart just fine.
Now we'll compare each feature set to every other feature set using Dynamic Time Warp Mathcing (DTW). DTW is a strategy that allows us to compare feature sets of different sizes.

The result of each comparison is a distance value that represents how similar the two feature sets are. We'll store these values in a distance matrix.
# Create a distance matrix measuring the distance between each pair of beats' MFCCs
distance_matrix = np.full((len(mfccs), len(mfccs)), -1)
for i in range(len(mfccs)):
for j in range(len(mfccs)):
# If we are comparing a MFCC set to itself, distance is 0
if i == j:
distance_matrix[i][j] = 0
# If we haven't calculated the distance already, calculate it
elif distance_matrix[i][j] == -1:
distance = librosa.sequence.dtw(mfccs[i], mfccs[j])[0][-1][-1]
distance_matrix[i][j] = distance
distance_matrix[j][i] = distance
# Plot
plot_utils.heatmap(distance_matrix, figsize=(5, 5), title='Distance matrix', xlabel='Beats', ylabel='Beats', show_yticks=True)
In our distance matrix here, a dark pixel at coordinate (x, y) indicates that beat x and beat y are similar. A yellow pixel indicates dissimilarity.
With this distance matrix, we can now use a clustering algorithm to group together similar beats. Let's try using hierarchical clustering.
import scipy.cluster
# Create hierarchical cluster of beats
condensed_matrix = scipy.spatial.distance.squareform(distance_matrix)
cluster_data = scipy.cluster.hierarchy.linkage(condensed_matrix, 'complete')
max_distance = cluster_data[:,2][-1]
# Plot
plot_utils.plot([0], figsize=(15, 5), title='Hierarchical cluster', xlabel='Beats', ylabel='Cut position', show=False)
ax = plt.gca()
ax.set_yticks([0, max_distance])
ax.set_yticklabels([0, max_distance])
scipy.cluster.hierarchy.dendrogram(cluster_data, ax=ax)
plt.show()
The above 'dendrogram' is the result of our hierarchical clustering. It represents multiple clustering options; to get any single clustering, simply make a horizontal cut across the chart and observe which nodes are connected.
The question for us is- where should we make this horizontal cut? How many clusters should we choose?
One method of determining this is by looking at how the number of clusters changes as we change the position of our horizontal cut.
# Extract cluster distances from the cluster data
cluster_distances = cluster_data[::-1,2]
num_clusters = range(1, len(cluster_distances) + 1)
# Calculate the second differential line
second_diff = np.diff(cluster_distances, 2)
num_clusters_second_diff = num_clusters[1:-1]
cluster_quantity = np.argmax(second_diff) + num_clusters_second_diff[0]
# Plot
plot_utils.plot(
x=num_clusters,
y=cluster_distances,
title='Distance between cluster quantities (with second diff)',
xlabel='Number of clusters',
ylabel='Cut position',
show_yticks=True,
show=False
)
plt.plot(num_clusters[1:-1], second_diff, c='orange')
plt.axvline(cluster_quantity, c='black', ls=':')
plt.text(x=3.15, y=1100, s='<= Knee point at ' + str(cluster_quantity) + ' clusters', fontsize=14)
plt.legend(['Original', 'Second differential'], fontsize=14, framealpha=1)
plt.show()
The purple line in the above graph shows how the number of clusters changes as we move the position of the cut.
A popular method of determining a 'good' number of clusters is to look for the steepest slope change in this purple line. We can do this by locating the maximum value of the purple line's second differential (show as the orange line).
This is known as a 'knee point'.
# For our chosen cluster quantity, retrieve the cluster pattern
pattern = scipy.cluster.hierarchy.fcluster(
cluster_data,
cluster_quantity,
'maxclust'
)
# Plot
multi_plot_utils.multi_plot(beats, title='', pattern=pattern)
Having chosen a cluster quantity, all that's left to do is determine which beats belong to which clusters. That's shown above.
And it worked great! Our only misclassification is beat #32, the unintentional knock noise.
We now know both where the beats are and what they represent. All that remains is to build a new track with similar sounding drums.
I've curated fifty drum kit sounds to choose from. For each beatbox sound, we'll run through these fifty drum sounds and determine which is the most similar using MFCCs and DTW.
DRUM_FOLDER = 'assets/drums/'
# Get all drum kit file paths
drum_files = os.listdir(DRUM_FOLDER)
# Extract MFCCs from each drum kit sound
drum_sounds = []
drum_mfccs = []
for file in drum_files:
audio, sample_rate = librosa.load(DRUM_FOLDER + file)
mfcc = librosa.feature.mfcc(
audio,
win_length=WINDOW_SIZE,
hop_length=WINDOW_OVERLAP,
n_fft=1024
)
drum_sounds.append(audio)
drum_mfccs.append(mfcc)
# Create a map from beatbox sound classifications to drum kit sounds
beatbox_to_drum_map = {}
# Keep track of which drum sounds we've used to avoid repeats
drums_used = []
for beat_mfcc, classification in zip(mfccs, pattern):
if classification not in beatbox_to_drum_map:
# Find the drum kit sound most similar to the beatbox sound
best_distance = float('inf')
best_drum_sound = None
best_drum_file = None
for drum_sound, drum_mfcc, drum_file in zip(drum_sounds, drum_mfccs, drum_files):
# No repeats allowed
if drum_file in drums_used:
continue
# Calculate distance using previous DTW approach
distance = librosa.sequence.dtw(beat_mfcc, drum_mfcc)[0][-1][-1]
if distance < best_distance:
best_distance = distance
best_drum_sound = drum_sound
best_drum_file = drum_file
# Update drum map
beatbox_to_drum_map[classification] = best_drum_sound
drums_used.append(best_drum_file)
# Plot
drum_sounds = beatbox_to_drum_map.values()
for index, (drum_sound, spelling) in enumerate(zip(drum_sounds, ["pft", "tss", "khh"])):
plot_utils.plot(
drum_sound,
title='Drum #' + str(index + 1) + ' (' + spelling + ')',
xlabel='Time (samples)',
ylabel='Amplitude',
figsize=(20, 2)
)
IPython.display.display(IPython.display.Audio(data=drum_sound, rate=sample_rate))
Great, we have our three drum sounds. Let's build the final output.
# Create 8 seconds of empty audio
track = np.zeros(sample_rate * 8)
# Add drum sounds to the track
for beat_start, classification in zip(beat_starts, pattern):
drum_sound = beatbox_to_drum_map[classification]
drum_sound = np.concatenate([
np.zeros(beat_start),
drum_sound,
np.zeros(len(track) - len(drum_sound) - beat_start)
])
track += drum_sound
# Plot
plot_utils.plot(track, title='Drum kit audio', xlabel='Time (samples)', ylabel='Amplitude')
print('Final product!')
IPython.display.display(IPython.display.Audio(data=track, rate=sample_rate))
print('And the original again:')
IPython.display.display(IPython.display.Audio(BEATBOX_AUDIO_FILE))
This project has been fun and the Beatbot algorithm works decently well!
A fined tuned version could potentially be useful as a tool in music production for amateurs or professionals looking to make quick beat mock ups.
To acheive a Beatbot algorithm that works even better, we'd probably want to use more recent, cutting edge speech processing techniques such as convolutional neural networks (CNNs).
Given a enormous unlabeled set of beatbox audio data (like 10,000+ hours), we could create a better featurization model than MFCCs by using a strategy like
wav2vec.Wav2vecis a project that performs unsupervised training on a CNN with massive amounts of audio data, creating a cutting edge audio featurization model.
- Guide to MFCCs: http://practicalcryptography.com/miscellaneous/machine-learning/guide-mel-frequency-cepstral-coefficients-mfccs/
- Explanation of DTW: https://towardsdatascience.com/dynamic-time-warping-3933f25fcdd
- Explanation of 'knee points': https://stackoverflow.com/a/21723473/4762063
- Wav2Vec: https://arxiv.org/abs/1904.05862